113 research outputs found

    Informatics expertise to support life and health sciences research and industry

    Get PDF
    Computing Infrastructure and Informatics to Support Life Sciences R&D, Therapeutics, Diagnostics and Economic Development PanelInterdisciplinary collaboration between computational sciences and life/health sciences is a hallmark of the MU Informatics Institute (MUII) and its new Informatics Ph.D. program. The Institute was established to foster synergy and interdisciplinary research applications in animal, plant, human health, geospatial and microbial sciences. Creative faculty and modern computation-based research facilities combine to enable groundbreaking collaborative research that relies heavily on informatics tools and expertise. In this talk, I will briefly introduce the informatics expertise of MUII core faculty in supporting experimental scientist's R&D activities with commercialization potentials by using an example scenario in personalized medicine. There are six signature research areas that are underpinning components: (1) high-throughput sequence assembly and analysis, (2) structural bioinformatics - prediction, retrievals, and interactions, (3) large-scale and high-throughput phenotype analysis, (4) data mining and knowledge discovery from large-scale omics databases and electronic health records (5) visualization and parallelism of informatics data, and (6) geospatial informatics

    Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals

    Get PDF
    BACKGROUND: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches – most of which are irrelevant to the researcher. RESULTS: We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes – without the main memory constraints present in other systems. CONCLUSION: We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request

    Genealogy browser: A framework for the management and analysis of genotypic and phenotypic plant data [abstract]

    Get PDF
    Abstract only availableResearchers who study plants need an efficient way to manage data that provides insight into genealogy and its effect on genotypic and phenotypic expressions. Traditional pencil and paper methods, arising from the need to collect data from the field, prove time consuming and error prone when tracking plant lineage. Genealogy Browser presents an effective method for plant management. By providing a web interface that collects information about plant families and relationships, the application provides the framework for data analysis. This setting allows researchers to collaborate about common plants and to become aware of the traits seen in other research groups. Furthermore, the application analyzes current gene and phenotype information to demonstrate where plants deviate from the expected, which will be of great importance to researchers especially as they study crops in varying climate settings. Genealogy Browser proves extremely vital for researchers needing to manage thousands of plants while obtaining useful information that sheds light on genotypic and phenotypic trends.National Science Foundation; Shumaker Endowment for Bioinformatic

    A fast SCOP fold classification system using content-based E-Predict algorithm

    Get PDF
    BACKGROUND: Domain experts manually construct the Structural Classification of Protein (SCOP) database to categorize and compare protein structures. Even though using the SCOP database is believed to be more reliable than classification results from other methods, it is labor intensive. To mimic human classification processes, we develop an automatic SCOP fold classification system to assign possible known SCOP folds and recognize novel folds for newly-discovered proteins. RESULTS: With a sufficient amount of ground truth data, our system is able to assign the known folds for newly-discovered proteins in the latest SCOP v1.69 release with 92.17% accuracy. Our system also recognizes the novel folds with 89.27% accuracy using 10 fold cross validation. The average response time for proteins with 500 and 1409 amino acids to complete the classification process is 4.1 and 17.4 seconds, respectively. By comparison with several structural alignment algorithms, our approach outperforms previous methods on both the classification accuracy and efficiency. CONCLUSION: In this paper, we build an advanced, non-parametric classifier to accelerate the manual classification processes of SCOP. With satisfactory ground truth data from the SCOP database, our approach identifies relevant domain knowledge and yields reasonably accurate classifications. Our system is publicly accessible at

    On Archiving and Retrieval of Sequential Images From Tomographic Databases in PACS

    Get PDF
    In the picture archiving and communication systems (PACS) used in modern hospitals, the current practice is to retrieve images based on keyword search, which returns a complete set of images from the same scan. Both diagnostically useful and negligible images in the image databases are retrieved and browsed by the physicians. In addition to the text-based search query method, queries based on image contents and image examples have been developed and integrated into existing PACS systems. Most of the content-based image retrieval (CBIR) systems for medical image databases are designed to retrieve images individually. However, in a database of tomographic images, it is often diagnostically more useful to simultaneously retrieve multiple images that are closely related for various reasons, such as physiological continguousness, etc. For example, high resolution computed tomography (HRCT) images are taken in a series of cross-sectional slices of human body. Typically, several slices are relevant for making a diagnosis, requiring a PACS system that can retrieve a contiguous sequence of slices. In this paper, we present an extension to our physician-in-the-loop CBIR system, which allows our algorithms to automatically determine the number of adjoining images to retain after certain key images are identified by the physician. Only the key images, so identified by the physician, and the other adjoining images that cohere with the key images are kept on-line for fast retrieval; the rest of the images can be discarded if so desired. This results in large reduction in the amount of storage needed for fast retrieval

    The News Crawler: A Big Data Approach to Local Information Ecosystems

    Get PDF
    In the past 20 years, Silicon Valley’s platforms and opaque algorithms have increasingly influenced civic discourse, helping Facebook, Twitter, and others extract and consolidate the revenues generated. That trend has reduced the profitability of local news organizations, but not the importance of locally created news reporting in residents’ day-to-day lives. The disruption of the economics and distribution of news has reduced, scattered, and diversified local news sources (digital-first newspapers, digital-only newsrooms, and television and radio broadcasters publishing online), making it difficult to inventory and understand the information health of communities, individually and in aggregate. Analysis of this national trend is often based on the geolocation of known news outlets as a proxy for community coverage. This measure does not accurately estimate the quality, scale, or diversity of topics provided to the community. This project is developing a scalable, semi-automated approach to describe digital news content along journalism-quality-focused standards. We propose identifying representative corpora and applying machine learning and natural language processing to estimate the extent to which news articles engage in multiple journalistic dimensions, including geographic relevancy, critical information needs, and equity of coverage

    Accelerating large-scale protein structure alignments with graphics processing units

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons.</p> <p>Findings</p> <p>We present <it>ppsAlign</it>, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, <it>ppsAlign </it>could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated <it>ppsAlign </it>on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH.</p> <p>Conclusions</p> <p><it>ppsAlign </it>is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU.</p

    Design and evaluation of a personal digital assistant-based alerting service for clinicians

    Get PDF
    Purpose: This study describes the system architecture and user acceptance of a suite of programs that deliver information about newly updated library resources to clinicians’ personal digital assistants (PDAs). Description: Participants received headlines delivered to their PDAs alerting them to new books, National Guideline Clearinghouse guidelines, Cochrane Reviews, and National Institutes of Health (NIH) Clinical Alerts, as well as updated content in UpToDate, Harrison's Online, Scientific American Medicine, and Clinical Evidence. Participants could request additional information for any of the headlines, and the information was delivered via email during their next synchronization. Participants completed a survey at the conclusion of the study to gauge their opinions about the service. Results/Outcome: Of the 816 headlines delivered to the 16 study participants’ PDAs during the project, Scientific American Medicine generated the highest proportion of headline requests at 35%. Most users of the PDA Alerts software reported that they learned about new medical developments sooner than they otherwise would have, and half reported that they learned about developments that they would not have heard about at all. While some users liked the PDA platform for receiving headlines, it seemed that a Web database that allowed tailored searches and alerts could be configured to satisfy both PDA-oriented and email-oriented users.Includes bibliographical references

    Efficient organization of genetic data for easier statistical analysis [abstract]

    Get PDF
    Abstract only availableWhen biological experiments are run, specifically those examining genetics, huge amounts of data are produced. This data is hard to organize and even harder to analyze without organization. When faced with a vast amount of data, statistical patterns are hard to observe as associations between different types of data are nearly impossible to see. A relational database was designed to handle this problem. The database organizes data taken by large-scale, long-term genetic experiments into linked categories, and uses an interface system such that makes many approaches to analyze the data possible. The database is unique from other types of organization in that it can be readily applied to these different approaches, while also dealing with subject-specific phenotypic information. This information is often left out in other database structures concerning genetic experiments. To show an application of the database, the structure is being applied to a long-term Maize genetics experiment. Using PHP scripting to insert data taken in the field into a MySQL system, the database is being used to create a search engine for plant geneticists to use to get genotypical information from phenotypical observation. However, the structure is not limited to plant genetics nor search engines, but can be applied to a multitude of bioinformatic and statistical studies of experiment data
    • …
    corecore